Introduction

300

My work occasionally involves documenting internal processes that use SQL. This not only requires me to have a solid understanding of the language but also to feel confident in querying and modifying data tables independently.

My goal here is to show you that I’m not completely helpless when it comes to playing with SQL. Since querying company data wouldn’t be appropriate, I’ll be using the next best thing: a dataset on the life expectancy of animals in US zoos!

In this personal project, I will be querying and modifying this dataset based on realistic demands of varying complexity.

Note

There will be no examples with JOIN clauses of any kind in this exercise. Unfortunately, I couldn’t find any dataset that I liked that had more than one table.

The dataset

I found a fairly comprehensive dataset that documents the life expectancy of 321 species in US zoos. It includes mammals, birds, and reptiles, but excludes fish. The dataset contains the following columns, among others:

  • Name: The name of the species

  • Scientific name: The scientific name of the species

  • Taxon class: The taxonomy of the species

  • Overall sample size: The total number of individuals of this species

  • Overall MLE: The overall life expectancy of this species

  • Male sample size: The number of male individuals of this species

  • Male MLE: The life expectancy of male individuals of this species

  • Female sample size: The number of female individuals of this species

  • female MLE: The life expectancy of female individuals of this species

  • Male data defecient: Whether the data is robust enough for the male individuals of this species

  • Female data defecient: Whether the data is robust enough for the female individuals of this species

Import the dataset into PostgreSQL

To import the dataset, which is a .csv file, into PostgreSQL, I’ll first need to create the table’s columns and specify their data type in pgAdmin.

CREATE TABLE LifeExpectancy (
    Name VARCHAR(100),
    ScientificName VARCHAR(200),
    TaxonClass VARCHAR(100),
	OverallSampleSize INT,
	OverallMLE NUMERIC (3,1),
	OverallCI_Lower NUMERIC (3,1),
	OverallCI_Upper NUMERIC (3,1),
	MaleSampleSize INT,
	MaleMLE NUMERIC (3,1),
	MaleCI_Lower NUMERIC (3,1),
	MaleCI_Upper NUMERIC (3,1),
    FemaleSampleSize INT,
	FemaleMLE NUMERIC (3,1),
	FemaleCI_Lower NUMERIC (3,1),
	FemaleCI_Upper NUMERIC (3,1),
	MaleDataDeficient BOOLEAN NULL,
	FemaleDataDeficient BOOLEAN NULL
	);

Once the dataset is imported, I can view in pgAdmin using a simple SELECT statement.

SELECT * FROM LifeExpectancy;
100%

All 321 animal species successfully imported.

Basic stuff, really

Let’s start with a few simple queries to better understand this dataset.

To begin, let’s identify the animal with the highest average life expectancy.

SELECT name, overallmle FROM LifeExpectancy
ORDER BY overallmle DESC
LIMIT 1;
100%

The Andean condor wins with 47 years on average in captivity!

Now, which animal has the shortest life expectancy on average?

SELECT name, overallmle FROM LifeExpectancy
ORDER BY overallmle ASC
LIMIT 1;
100%

Poor mossy leaf-tailed gecko! An average lifespan of 2.4 years sure is short.

Earlier, I mentioned that the dataset comprises three taxons: mammals, birds, and reptiles. We know that the Andean condor is the longest-living animal of the bunch and that it’s a bird. So, what are the longest-living mammals and reptiles in this dataset?

Let’s start with the mammals:

SELECT name, overallmle FROM LifeExpectancy
WHERE taxonclass = 'Mammalia'
ORDER BY overallmle DESC
LIMIT 3;
100%

Seems like the longest-living mammals are the Asian elephant, the chimpanzee, and the African elephant, in this exact order. I personally find it impressive that a bird can outlive two of the largest land-dwelling animals in the world, as well as our closest primate cousin.

Now onto the reptiles:

SELECT name, overallmle FROM LifeExpectancy
WHERE taxonclass = 'Reptilia'
ORDER BY overallmle DESC
LIMIT 3;
100%

The most resilient reptiles in this dataset are the Chinese alligator, the Rio Fuerte beaded lizard, and the Jamaican boa.

Now that we have a better understanding of this dataset, let’s move on to slightly more involved queries.

Aggregate statistics

While we know that the longest-living animal here is a bird, it is possible that it is an exception. On average, perhaps mammals or reptiles may live longer. Let’s find out. Let’s find out the average life expectancy of each animal group.

SELECT taxonclass, ROUND(AVG(overallmle),2) FROM LifeExpectancy
GROUP BY taxonclass
ORDER BY AVG(overallmle) DESC;
100%

Interesting! It seems that there were more animal groups in the dataset that I had thought. Turns out members of the cartilaginous fishes live the longest (no surprise here), while the amphibians have the shortest lifespans.

220
Who lives longer? Males or females?

The dataset includes average lifespans segmented by sex for each animal. We know that human females generally live longer than their male counterparts. But does this hold true across the rest of the animal kingdom? Do female animals outlive their male counterparts, or vice versa?

SELECT taxonclass, ROUND(AVG(malemle),2) AS male_avg, ROUND(AVG(femalemle),2) AS female_avg
FROM LifeExpectancy
GROUP BY taxonclass
ORDER BY ROUND(AVG(malemle),2) DESC;
100%

I guess the female sex lives longer only among mammals; males live longer on average across the remaining groups. We learn something new every day!

The middle value

Based on previous queries, the longest-living animal in this dataset is the andean condor with 47 years, and the shortest-living is a gecko that expires after only 2.4 years. Let’s find the animal that has the median life expectancy in this dataset.

SELECT name, overallmle FROM LifeExpectancy
ORDER BY overallmle
LIMIT 1 OFFSET (SELECT COUNT(*)/2 FROM LifeExpectancy);
100%

The North American river otter is our winner at 12 years. With a total of 321 animals in the dataset, this means there are exactly 160 animals that live longer and 160 that live shorter than it.

Range and variance

As for last aggregate statistics, let’s check out measures of range and variance across the taxons in the dataset.

The range is relatively elementary; it is simply the difference between the maximum value and the minimum value. In our context, it means the difference in life expectancy between the longest-living and the shortest-living animals of each taxon.

SELECT taxonclass, MAX(overallmle) - MIN(overallmle) AS life_expectancy_range
FROM LifeExpectancy
GROUP BY taxonclass
ORDER BY MAX(overallmle) - MIN(overallmle) DESC;
100%

According to our results, birds have the largest range in this dataset, while amphibians have the smallest. This means that some birds live a very long time, while others have a much shorter lifespan. Meanwhile, the life expectancy of most amphibians is more or less clustered.

A more "scientifically accurate" way of measuring how values are spread out is standard deviation, which calculates how much the values differ or deviate from the mean. In this context, the standard deviation of each animal group refers to how much the life expectancy of each animal within the group deviates from the average life expectancy of that group. A high standard deviation indicates that the life expectancies of the animals in that group vary widely from the average, suggesting a broader range of lifespans.

SELECT taxonclass,
	ROUND(STDDEV(overallmle), 2) AS life_expectancy_stddev,
	ROUND(AVG(overallmle), 2) AS avg_life_expectancy
FROM LifeExpectancy
GROUP BY taxonclass
ORDER BY ROUND(STDDEV(overallmle),2) DESC;
100%

The results of the cartilaginous fishes group stand out. Not only does this group have the highest average life expectancy, but it also has the highest standard deviation. This suggests that the number of cartilaginous fishes in the dataset is likely small, and that the longest-living species outlives the shortest-living by a significant margin.

Let’s take a deeper look at these fish in our dataset. First, let’s count the number of cartilaginoush fishes we have.

SELECT COUNT(*) FROM LifeExpectancy
WHERE taxonclass = 'Chondrichthyes';
100%

Only two species?! The previous results would make total sense if one of them lives significantly longer than the other. Let’s find the cartilaginous fishes.

SELECT name, overallmle FROM LifeExpectancy
WHERE taxonclass IN ('Chondrichthyes');
100%

As anticipated, the sand tiger shark has a considerably longer lifespan than the ocellated river stingray. Though they belong to the same taxon, they are vastly different animals.

Notable animals

Let’s throw in a few random queries with completely arbitrary conditions and see who the lucky winners are!

Long-living reptiles

Reptiles seem to be underperforming in our dataset, with their lifespans being relatively shorter than those of other animals. But surely there must be a few reptiles that defy this trend, right?

Which reptiles have a lifespan longer than the overall average in this dataset?

SELECT name,
	overallmle AS average_lifespand,
	(SELECT ROUND(AVG(overallmle),2) AS dataset_average_lifespan FROM LifeExpectancy)
FROM LifeExpectancy
WHERE taxonclass = 'Reptilia'
AND overallmle > (SELECT AVG(overallmle) FROM LifeExpectancy)
ORDER BY overallmle DESC;
100%
220

Six reptiles were found to be having above-average lifespans. Go alligators!

Similar lifespans

For fun, let’s find out which animals have the same lifespans. The goal here is to identify animal pairs having the same average life expectancy.

SELECT t1.name, t2.name, t1.overallmle FROM LifeExpectancy AS t1
INNER JOIN LifeExpectancy AS t2
ON t1.name != t2.name
AND t1.overallmle = t2.overallmle
AND t1.name < t2.name
ORDER BY t1.overallmle DESC;
100%

Wow, 242 rows! I was certainly not expecting that many distinct pairs of animals sharing the same lifespan. The chimpanzee lives just as long as the African elephant…​

Sexual longevity dimorphism

Earlier, we discussed the differences in longevity between males and females across various taxons. Now, I’m curious to know which specific animals exhibit the greatest lifespan differences between the sexes.

SELECT name,
    ABS(ROUND(AVG(malemle), 2) - ROUND(AVG(femalemle), 2)) AS longevity_difference
	malemle,
    femalemle
FROM LifeExpectancy
GROUP BY name, malemle, femalemle
ORDER BY longevity_difference DESC
LIMIT 5;
100%

Wow, the African elephant wins another entry. It’s actually quite tragic that the bulls perish at 24.1 years in captivity when their female counterparts can carry on for much longer.

Another thing worth mentioning here is that our previous finding on sexual longevity dimorphism remains consistent; mammals are the only group in which females live longer than males.

Extent of sexual longevity dimorphism across the dataset

In the previous section, we retrieved the animals that displayed the greatest degree of sexual longevity dimorphism. Now, we will attempt to provide a bird’s eye view of sexual longevity dimorphism across the entire dataset.

Specifically, we will categorize the extent of sexual longevity dimorphism for each animal as "significant", "some", or "negligible" based on the difference between the animal’s average male life expectancy and its average female life expectancy. The total number for each three category will then be tallied in a table.

WITH LifeExpectancyDifference AS (
    SELECT
        ABS(ROUND(AVG(malemle), 2) - ROUND(AVG(femalemle), 2)) AS life_expectancy_difference
    FROM LifeExpectancy
    GROUP BY name
)
SELECT
    sexual_longevity_dimorphism,
    COUNT(*) AS number_of_animals
FROM (
    SELECT
        CASE
            WHEN life_expectancy_difference >= 2 THEN 'Significant'
            WHEN life_expectancy_difference BETWEEN 1 AND 2 THEN 'Some'
            WHEN life_expectancy_difference < 1 THEN 'Negligible'
        END AS sexual_longevity_dimorphism
    FROM LifeExpectancyDifference)
GROUP BY sexual_longevity_dimorphism;
100%

Interesting; it seems that most animals display at the very least some degree of sexual longevity dimorphism. Only about one-third of animals display a negligible extent.

Outliers

In statistics, an outlier is a data point that is two standard deviations away from the mean, either above or below it. In the context of our exercise, an outlier would be an animal that lives, on average, much longer or much shorter than the overall average life expectation of all animals in the dataset.

Let’s find all the outliers in the dataset.

WITH GlobalStats AS (
    SELECT
        ROUND(AVG(overallmle),2) AS avg_life_expectancy,
	    ROUND(STDDEV(overallmle),2) AS stddev_life_expectancy
    FROM LifeExpectancy
)
SELECT name, overallmle, avg_life_expectancy, stddev_life_expectancy
FROM LifeExpectancy, GlobalStats
WHERE overallmle > (avg_life_expectancy + 2 * stddev_life_expectancy)
   OR overallmle < (avg_life_expectancy - 2 * stddev_life_expectancy)
ORDER BY overallmle DESC;
100%
250

No fewer than 15 animals were found to be outliers, and all skew longer lifespans. To verify that these animals are indeed outliers, we can sum the average life expectancy of all animals with the standard deviation x 2 and compare that number with the average life expectancy of a given outlier. This operation yields 28.41 years (13.35 + 7.53 x 2), which is smaller than the Sumatran orangutan's 28.7 years. Everything looks good!

Data-deficient animals

All the queries we’ve entered so far have not taken into account the quality of the data. As indicated in the dataset, some animals are deemed "data deficient." Additionally, the sample size of some others is small (< 30), which makes the resulting aggregate statistics, such as the average, unreliable.

Animals worth a damn

First, let’s compare the total number of animals in this dataset with the number of animals that have "good" data only.

SELECT taxonclass,
    COUNT(*) AS total_count,
    (SELECT COUNT(*)
     	FROM LifeExpectancy T2
    	WHERE T2.taxonclass = T1.taxonclass
       		AND T2.malesamplesize > 30
       		AND T2.femalesamplesize > 30
       		AND T2.maledatadeficient IS NULL
       		AND T2.femaledatadeficient IS NULL) AS good_data_count
FROM LifeExpectancy T1
GROUP BY taxonclass;
100%

The results show that mandating "good" data thins the dataset considerably.

Trimming the fat

It’s worth examining whether omitting the "bad" data changes the measurements in any way. For example, we can compare the "default" average life expectancy by taxon with the average life expectancy of animals that meet the "good" data criteria, also by taxon.

SELECT T1.taxonclass,
       ROUND(AVG(T1.overallmle), 2) AS overallmle,
       ROUND(
           (SELECT AVG(T2.overallmle)
            FROM LifeExpectancy T2
            WHERE T2.taxonclass = T1.taxonclass
              AND T2.malesamplesize > 30
              AND T2.femalesamplesize > 30
              AND T2.maledatadeficient IS NULL
              AND T2.femaledatadeficient IS NULL), 2) AS complete_overallmle
FROM LifeExpectancy T1
GROUP BY T1.taxonclass;
100%

The new averages are lower across the board. This means that the "bad" data had an inflationary effect on life expectancies.

Manipulating table data

Welcome to the home stretch. In this section, we’ll make some arbitrary changes to the dataset itself and call it a day.

Dropping data I don’t like

I noticed that there were some columns I had never touched, not even once, during the entirety of this exercise, namely those pertaining to confidence intervals.

100%

Because I like my data lean and clean, I’m deciding to drop these columns from the table.

ALTER TABLE LifeExpectancy
DROP COLUMN overallci_lower,
DROP COLUMN overallci_upper,
DROP COLUMN maleci_lower,
DROP COLUMN maleci_upper,
DROP COLUMN femaleci_lower,
DROP COLUMN femaleci_upper;
100%

Much better!

A new animal appears!

What’s that? An animal was omitted from the dataset? An arctic fox you say? We must add this cute creature back in the dataset in ASAP.

INSERT INTO LifeExpectancy(name,scientificname,taxonclass,overallsamplesize,
overallmle,malesamplesize,malemle,femalesamplesize,femalemle,maledatadeficient,femaledatadeficient)
VALUES ('Arctic fox','Vulpes lagopus','Mammalia',554,10.3,230,9.9,324,10.5,NULL,NULL)
100%

Success! Our adorable fox has been included in the dataset.